
This README gives reproducibility information for the cross-sectional analysis of underproduced packages in Debian. 

[Citation will appear here when published]

To reproduce this project requires extensive processing time. Code to be run is marked with ==>. 

Go to Step 6 and 7 if you would like to avoid a ground-up rebuild and instead only review the analytical code for the dataset. 

Step 0: Obtain underproduction dataset, bug data, and package data.

	1. This analysis requires the use of data previously published as part of:

	Champion, Kaylea and Benjamin Mako Hill. (2021) "Underproduction: An approach for measuring risk in open source software.'' 28th IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). Preprint: https://arxiv.org/abs/2103.00352. DOI: 10.1109/SANER50967.2021.00043 

	The dataset release is: 

	Champion, Kaylea; Hill, Benjamin Mako, 2021, "Replication data and online supplement for: Underproduction: An Approach for Measuring Risk in Open Source Software", https://doi.org/10.7910/DVN/PUCD2P, Harvard Dataverse, V2, UNF:6:A8MV1fxlZnJtlKI3DnGaRg== [fileUNF]

	From this dataset release, the file needed is:
	inst_all_packages_full_results.tab 

	2. Debian bug reports, available via rsync. Because we used the data from Champion and Hill 2021, we restrict our bug data to the same range as they did (historical through August 7 2020, approximately 150 GB). For access, see the instructions at https://www.debian.org/Bugs/Access

	3. Debian package information from the Ultimate Debian Database

	a. So that the data is named consistently, and to save repetition, we use a
	script to generate a scripted set of commands to fetch all of the project-specific 
	files in the project list.  (From Champion and Hill 2021)
		data/analyzeBugs.py
		data/get_Bugs.py
                data/make_analyzeBugs_Caller.py
		data/make_getBugs_Caller.py
		data/make_Namelist.py
                data/make_PSQL_Caller.py

		make_PSQL_Caller.py
			--> uses projectlist.txt 
			<-- generates the callPSQL.txt file

	b. You can then connect to the database and run the queries. There are many ways to do this,
		for example using the PSQL commandline client:
		psql --host=udd-mirror.debian.net --user=udd-mirror udd --password
		(password: udd-mirror)

		You can then execute the commands in callPSQL.txt to generate local project-specific files. 
		The files will have names like: public.source.all_bugs.project_IS_accerciser.txt
	c. Move the public.source.all_bugs.project.* into the project_data directory.

	This will give you a list of bug IDs for each project.

	d. Changelogs for each project.

Step 1: Set aside just the bugs we need for the project

The project-specific files have bug IDs in them. The next stage fetches all the bugs for all the projects.

	a. This process is also controlled by a script-generating script. Run make_getBugs_Caller.py

		==> make_getBugs_Caller.py
			--> uses the project files in project_data/
			<-- generates the callGetBugs.txt file

	b. Execute the lines in callGetBugs.txt, which are repeated calls to the getBugs.py script.
		I used GNU Parallel like this: parallel < callGetBugs.txt

		get_Bugs.py
			--> uses the public.source.all_bugs.* files in the project_data directory
			<-- generates ######.mbox files in the mbox_files directory


Step 2: Parse the bugs into a format that supports network analysis


	Part A: canonify names

	Please note that we have not included our aliases list to prevent causing harm (spam) to participants. If you
	need our original file, do contact us.

	The #####.mbox files are dumps from the bug-tracking system, which structures bug updates as e-mail 
	messages. The next stage is to parse these mbox files and determine a social network from them.
	This requires two pieces of information -- nodes and edges. In this case, each unique individual
	is a node, and working together on a bug means there's an edge between you. Since a person might
	use multiple e-mail addresses, we attempt to generate a list of aliases and canonical names. 

	We have two challenges: first, the same person may do work using many different names and e-mail addresses, 
	and second, different people may use the same name. Inspection of bug reports revealed that is indeed quite 
	common for close variations of the same name and several e-mail addresses to be the same person 
	-- they might report a bug with their corporate e-mail address, comment on it with a personal 
	e-mail address, and then ultimately close the bug with a fix using their Debian address (all while 
	using the same distinctive name). 

	The name canonization logic we used runs like this: 
	- if two addresses have the same name and work on the same bug, we call them the same person, yielding a set of associations
	- once we have a set of associations within a given bug, we can work through that set of associations to find new associations across bugs, 
	and clean up chains of aliases (i.e. where a --> b and b --> c and c --> d, what we want instead is a --> d, b --> d, and c --> d, where d 
	is the canonical name.
	- @debian.org addresses are the best choice for a 'canonical' name for someone, although not everyone will have one
	- people do a range of creative things with their names and email addresses (for example, a custom e-mail address for each bug, 
	e.g. janesmith-bug12345@iamjanesmith.net); manual inspection is useful to solve this but the file will never be perfect.

	This approach has a key limitation: if more than one person calling themselves John Smith worked on the same bug, we will conflate them. 
	This is a trade-off; at least we won't believe that somehow Aelphaba Higginbottom is three different people just because she's used a 
	@debian, @gmail, and @employer address in the course of solving a bug.

	==> customize StepMakeNetwork/config.yaml to match your environment
	==> StepMakeNetwork/fromBugsMakeAliasfile.py (initial associations)
	==> StepMakeNetwork/cleanAliasfile.py (cleanup routines; do your manual inspection after this)
        ==> StepMakeNetwork/fromBugsMakeNetwork.py (builds networks, they'll be saved in a location you specify in your config.yaml)

	Although we use co-working on a bug to seek out aliases, our network is based on co-working on a package.

	==> StepMakeNetwork/standalone.R (check the required libs, customize globals at the top of the file; this takes a long time to run due to the betweenness calculation!)

Step 3: Use the project-level data to identify project language
	Using the main configuration file in collabnetXS/config.yaml, we parse the dump from the Ultimate Debian Database to identify package tags in python, then clean it up in R. This is also where the code for calculating mean language age lives, in case you see a need to change how I calculate that.

	==> StepFigureOutLanguage/languageDataNoPandas.py ## why does this say NoPandas when it uses Pandas??
	==> StepFigureOutLanguage/languageCleanup.R ## library is changing support for across(), may need code update in future

Step 4: Parse uploads data to identify maintainers, uploads, and the duration. The goal here is to parse fairly predictable free text. The script tries multiple times to find a date at the end of the file (examining the last 5 lines) but if it doesn't find anything, bails out.

	==> StepTrueBirth/parseRelease.py  > errors.txt ## The script will give some errors where manual intervention is needed, so check the errors file and fix.

	Errors like this:
	
		Date parse weirdness when trying to understand  Mon, 17 Jul 2017 10:14:08 -5000 in raw_data/changelogs/metadata.ftp-master.debian.org/changelogs/main/j/jaraco.itertools/unstable_changelog

		...need to be resolved in a consistent way (my approach is to treat bogus offsets as having a misplaced 0, i.e. 5000 should be 0500)

	Errors like this: 

		python3.10/site-packages/dateutil/parser/_parser.py:1207: UnknownTimezoneWarning: tzname MET identified but not understood.

		Process: edit the data file changelogDates.tsv and replace the NA for the associated package

		MET DST is +02:00
		MST is -07:00
		missing times are 00:00:00+00:00
		ambiguous dates are read American style, i.e. 5/9/95 is May 9th 1995
		missing dates altogether, step backward until date is provided
		Any dates that claim to be in the 1970s or 1980s also need manual repair.
		Essentially the timezones are not supported in the library and are resolved by looking up the timezone online to see what should be in the library but isn't)

	Errors like this:

		No date at the end of raw_data/changelogs/metadata.ftp-master.debian.org/changelogs/main/x/xarclock/unstable_changelog -- do a manual extraction.

		...are resolved by manually reading the file and looking at the date, then replacing the NA with the date.


Step 6: Assemble dataset

	lib-00-utils.R ## this is a helper library for working with R and latex
	==> StepMakeDataset/prepData.R


Step 7: Run Analysis and Generate Figures
	==> standalone.R
